-
Notifications
You must be signed in to change notification settings - Fork 314
KFLUXBUGS-863: introduce results fixes with requisite performance related tuning (dev/stage) #3616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KFLUXBUGS-863: introduce results fixes with requisite performance related tuning (dev/stage) #3616
Conversation
|
Forgot to mention @enarha @savitaashture @khrm @hugares @stuartwdouglas @sayan-biswas @avinal @ramessesii2 Per https://go.dev/doc/diagnostics#profiling and the "Can I profile my production services?" section I am considering just enabling profiling, but without cpu profiling, in case we ever have to exec into the api server or watcher and get a goroutine dump. If you have opinions there please advise. I plan on discussing in the arch call as well. |
f2d9e2a to
81271be
Compare
|
you probably already saw @pmacik but the e2e's are clean and this at least deploys now |
|
/test appstudio-e2e-tests |
81271be to
e4e903c
Compare
|
still green @enarha @pmacik ... going to pull in through openshift-pipelines/pipeline-service#994 |
…to one to match prod Since we had the change the threading model for log storage in order to fix the results watcher memory leak, rather than just the normal pipeline-service bump to pick up the new results images, I have picked those commits and am combining them with our initial increase in k8s client tuning and worker threads for both the api server and watcher. Aside from the memory leak, the worker thread and k8s client tuning have proven to greatly improve controller latency with the core tekton controller and chains controller. The hope is this will help the watcher as well, both when storing as well as pruning. Also, during the result outage, I saw some semblance of leader election based HA going, where prior testing had only ever showed backup/restore behavior. Until sharding can be properly tested, I don't want chance enablement based on replica start up timing occurring. rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
e4e903c to
d821cc9
Compare
|
@gabemontero Just to understand the end goal here, do we want to eventually promote those changes to prod or is it the intention to only use that in stage? |
eventually these results fixes / tuning absolutely need to go into prod @enarha just being super careful given the relative fragility of results ideally, we complete perf testing @pmacik agreed to take on |
|
ok still green ... going to try some local results testing |
|
works for me locally as well @enarha @pmacik ... none of the x509 cert errors that @pmacik saw a couple of weeks ago on his test cluser ..... I'm going to re-read https://issues.redhat.com/browse/KONFLUX-2096 and see if I can run his test against my cluster while he is busy with RHDH. |
rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
|
ok e2e's pass even with the enablement of pprof in the watcher (see my Apr 30 konflux community presentation https://docs.google.com/document/d/18Rt4VQh3MH7_O_FdkpmRp8-PacOyGPq4WCxT4Jm-ZD0/edit#heading=h.jpnstykbavio on why we are doing this at least for the interim) testing locally to confirm basic results still works, as well as being able to get thread dump if needed |
|
Local testing looks good: see https://issues.redhat.com/browse/KFLUXBUGS-863?focusedId=24697433&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-24697433 If Pavel cannot get to S3 based testing soon, may move forward with this |
rh-pre-commit.version: 2.2.0 rh-pre-commit.check-secrets: ENABLED
24aa628 to
ce3b40a
Compare
|
/test appstudio-hac-e2e-tests |
|
I've also been gathering staging data to compare with after this merges. Even on the much less used stage envs for example, I'm seeing results watcher queue depth go over 10. My stress testing on my personal cluster can equal or beat that. |
|
Created 1000 simple pipelineruns (each with 2 taskruns) staring just after 4 PM US Eastern Friday May 10 on stage-m01 over a 50 minute timespan in my tenant (yes I pushed up against the Observations:
I'll get a comparison on stage-m01 after this PR merges. Aside from node size, minio vs. S3 is a key difference. |
|
[APPROVALNOTIFIER] This PR is APPROVED Approval requirements bypassed by manually added approval. This pull-request has been approved by: gabemontero The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/hold cancel |
…ance related tuning (dev/stage) (redhat-appstudio#3616)" This reverts commit 8b62b98.
|
So the minio vs. s3 difference had far more of the opposite effect than anticipated .... log storage was much slower with the simple stress test going to S3 than it had been with the same code/settings as to minio. I'm about done gathering some debug data, and then will revert this PR with #3688 After I have had chance to analyze the data, I'll report back on next steps: more threads, larger timeouts, or it is time for a total redesign or pivot to kubearchive. |
This replaces #3594 for dev and stage
I've initially place a hold on this, as I want to get sign off from multiple sources before merging.
Our second attempt at fixing the watcher memory leak. This also address intermittent failure to store logs because of cancelled context.
This also brings in the ability to pprof profile the api server, though it is not enabled at this time.
Since we had to change the threading model for log storage in order to fix the results watcher memory leak, rather than just the normal pipeline-service bump to pick up the new results images, I have picked those commits and am combining them with our initial increase in k8s client tuning and worker threads for both the api server and watcher.
I AM AVOIDING DATABASE LEVEL TUNING FOR NOW, SAVING THAT FOR A SEPARATE EFFORT
Aside from adjust for the memory leak fix, the worker thread and k8s client tuning have proven to greatly improve controller latency with the core tekton controller and chains controller. The hope is this will help the watcher as well, both when storing as well as pruning.
Also, during the result outage, I saw some semblance of leader election based HA going on, where prior testing had only ever showed backup/restore behavior. Until sharding can be properly tested, I don't want chance enablement based on replica start up timing occurring.
Lastly, I want to start moving away from https://github.com/openshift-pipelines/pipeline-service/
As such, I'm replicating the api server config map generator entirely here. If this becomes a point of contention I cannot convince others about, I'll create a PR afterward to add config.env deltas to the pipeline-service repo, but I don't want to delay this change more than needed.
@enarha @savitaashture @khrm @hugares @stuartwdouglas PTAL
@sayan-biswas @avinal @ramessesii2 .... given the importance of this change, I would ask you all PTAL as well when you have cycles. Thanks.